Project 1 - Inverted Indexer

Assigned: 1/22/08 Due: 1/29/08 at 4:30 PM

Review:

An inverted index is a mapping of words to their location in a set of documents. Most modern search engines utilize some form of an inverted index to process user-submitted queries. In its most basic form, an inverted index is a simple hash table which maps words in the documents to some sort of document identifier. For example, if given the following 2 documents:

We could construct the following inverted file index:

Data Set:

As your final data set, you will use a crawl of wikipedia articles. The data set is located on the DFS at /shared/wikipedia (roughly 12 GB of text in 193 files each sized roughly 64MB). Each wikipedia article is on its own line. A typical line of the data set (article) will start as follows:

<page> <title>Critical theory</title>

<text xml:space="preserve">In the [[humanities]] and [[social sciences]], '''critical theory''' has two quite different meanings with different origins and histories, one originating in [[social theory]] and the other in [[literary criticism]]

Notice that the title of the article is delimited by an XML title tag, and the text of the article is delimited by an XML text tag. The links in the article to other wikipedia articles are delimited by "[[" and "]]". Also, certain links will look like this: "[[Longer Real Name|Short Name]]". These links appear in the article with "Short Name" but link to an article titled "Longer Real Name".

Part I:

Some words are so common that their presence in an inverted index is "noise" -- they can obfuscate the more interesting properties of a document. In addition to this, the data set is full of markup, punctuation, and variations in capitalization. For the first part of this project, run a word count over an interesting corpus (e.g. Shakespeare or wikipedia) to identify any noise words. Your goal is to remove these words from your final output as well as "scrub" punctuation and XML markup and normalize the words. One strategy to accomplish this might be multiple map-reduce passes, but you are free to do this however you want. Lastly, include any assumptions you made about punctuation removal/normalization of words in your write-up, extra karma for creating language-independent scrubbing algorithm!

Part II:

For this portion of the assignment, you will design a MapReduce-based algorithm to calculate the inverted index over the data. Your final inverted index should not contain the words identified in part I of this project. Also, the format of your MapReduce output (i.e. the inverted index) must be simple enough to be machine-parseable; it is not impossible to imagine your index being one of many data structures used in a search engine's indexing pipeline. Lastly, your submitted indexer should be able to run successfully on the wikipedia corpus. "Successfully" means it should run to completion without errors or exceptions, and generate the correct word->title mapping.

Part III:

In addition to the requirements detailed above, implement at least one extension of your choice. The extensions may be as simple or as complex as you'd like; the primarily goal is to give you a chance to play with the MapReduce system. We have provided some sample extensions that you are free to choose from:

Write a query program on top of your inverted file index, which will accept a user-specified word (or phrase!) and return the IDs of the documents which contain that word.
Instead of creating an inverted file index (which maps words to their document ID), create a full inverted index (which maps words to their document ID + position in the document). How do you specify a word's position in the document?
If you created a full inverted index, write a query program on top of the index which returns not only the document IDs but also a text "snippet" from each document showing where the query term appears in the document.

Write-up:

How long do you think this project would take to finish?
How much time did you actually spend on this project?
What noisy words did you find as a result of your WordCount? By what criteria did you decide where to draw the line? Were you surprised by any of the words?
Which extensions did you do? Please detail any interesting results you found.
Acknowledge any outside assistance you received from anyone except assigned course readings and the course staff.

Hints:

When writing your mapreduce code, test it on small data sets first, this will allow you to spot problems much quicker. One suggestion to obtain a smaller data set, is to copy over the first few lines of one of the 193 parts of the wikipedia data and use that as a test data set.
You can change how data is handled when read into your mapper. By default the input configuration is "TextInputFormat" which breaks the input into lines and each mapper gets the line as a value and its offset in the document as the key. To change it from the default, the relevant function is "setInputFormat" on your job configuration in the map reduce driver. Specifically, consider the "KeyValueTextInputFormat"
You can also control the sorting that happens during the shuffle phase of the map reduce. Normally all output from the mapper is considered text and is sorted lexographically by key, however you can change the type of key the mapper outputs. The relevant function is "setMapOutputKeyClass" on your job configuration in the map reduce driver. Specifically consider "IntWritable"
You can pass parameters to your mappers - for example if there is some small ammount of data loaded at runtime that you want to be visible to all of your mappers. See this link for details.
The Hadoop API documentation is found at the hadoop website - It can be very useful. Quick Link.
Read the Cluster Notes and use the Job Tracker!

Turn-In instructions:

Please create an archive (preferably zip) that contains:

All of your sourcecode (mapreduce and otherwise if any) that you wrote for your project.
Your write up in text, doc, or pdf

Do not include any of your output, gigabytes of text, grandma's meat loaf recipe, etc.
Turn the archive in to the catalyst drop box here before 4:30 on 1/29/08